The parsed data is stored in the file listed in the column parsed_file and contains to separate tibbles, one with the detailed parsing of the script and the second which aggregates all the text for both scene directions and dialogue.
1.1 Load Sentiments Data
We also want to load the data around the sentiments.
Rows: 39
Columns: 6
$ film_title <chr> "A Few Good Men", "Airplane", "Apocalypse Now", "Au…
$ release_year <int> 1991, 1977, 1979, 1997, 1992, 1955, 1974, 1990, 198…
$ genre <chr> "Drama", "Comedy", "War", "Comedy", "Drama", "Weste…
$ title_cleaned <chr> "a_few_good_men", "airplane", "apocalypse_now", "au…
$ parsing_detailed <list> [<tbl_df[8443 x 13]>], [<tbl_df[5646 x 13]>], [<tb…
$ parsing_aggregated <list> [<tbl_df[1794 x 5]>], [<tbl_df[1245 x 5]>], [<tbl_…
2 Initial NLP Processing
We now want to perform some very basic NLP processing such as tokenisation.
Once we have tokenised the script, we also remove “stop words” - that is, common words that do not convey meaning, such as “and”, “to”, “the” and so on.
plot_stemmed_tbl <- films_stems_tbl |>count(word = snowball_stem) |>slice_max(order_by = n, n =500)ggwordcloud2(plot_stemmed_tbl, size =2, seed =422)
2.3 Contrasting Dialogue and Direction
Finally we look just at the words in the lines of dialogue and focus on this.
Show code
plot_dialogue_tbl <- films_wordtoken_tbl |>filter(flag_dialogue ==TRUE) |>count(word) |>slice_max(order_by = n, n =500)ggwordcloud2(plot_stemmed_tbl, size =2, seed =422)
3 Sentiment Analysis
Sentiment analysis takes the simple approach of assigning some kind of measure of sentiment or emotion to each word, allowing us to quantify these concepts in the text in various ways.
Note that this approach is simplistic: it does not consider context or anything beyond the presence of each word, but it is a quick and simple thing to look at.
There are a number of different sets of sentiment data, so we
3.1 Visualising the NRC Sentiments
We use the NRC sentiments and the count the appearance of each emotion in this dataset.
plot_sentiments_tbl <- films_wordtoken_tbl |>inner_join(sentiments_nrc_tbl, by ="word") |>count(title_cleaned, sentiment)ggplot(plot_sentiments_tbl) +geom_tile(aes(x = title_cleaned %>%str_trunc(20), y = sentiment, fill = n) ) +scale_fill_gradient(low ="blue", high ="red") +labs(x ="Film Title",y ="Sentiment",fill ="Raw Count",title ="Sentiments in Film Scripts" ) +theme(axis.text.x =element_text(angle =20, vjust =0.5))
Raw counts are interesting, but it is also worth looking at scaling these counts by the total word count of the script, and then plot each of those counts as a ratio of the total word count in the script.
Show code
films_wordcount_tbl <- films_wordtoken_tbl |>count(title_cleaned, name ="total_count")plot_sentiments_ratio_tbl <- films_wordtoken_tbl |>inner_join(sentiments_nrc_tbl, by ="word") |>count(title_cleaned, sentiment, name ="word_count") |>inner_join(films_wordcount_tbl, by ="title_cleaned") |>mutate(word_ratio = word_count / total_count)ggplot(plot_sentiments_ratio_tbl) +geom_tile(aes(x = title_cleaned %>%str_trunc(20), y = sentiment, fill = word_ratio) ) +scale_fill_gradient(low ="blue", high ="red") +labs(x ="Film Title",y ="Sentiment",fill ="Ratio",title ="Sentiments in Film Scripts" ) +theme(axis.text.x =element_text(angle =20, vjust =0.5, size =8))
3.2 Visualising afinn Sentiments
We now repeat the above exercise, but using the sentiment words in the afinn data - in this dataset each word is assigned a positive or negative number of the degree of ‘positivity’ associated with the word.
plot_sentiments_tbl <- films_wordtoken_tbl |>inner_join(sentiments_loughran_tbl, by ="word") |>count(genre, title_cleaned, sentiment, name ="sentiment_count") |>inner_join(films_wordcount_tbl, by ="title_cleaned") |>mutate(sentiment_ratio = sentiment_count / total_count )ggplot(plot_sentiments_tbl) +geom_col(aes(x = title_cleaned %>%str_trunc(20), y = sentiment_ratio, fill = sentiment),position ="dodge" ) +scale_fill_brewer(type ="qual", palette ="Set1") +labs(x ="Film Title",y ="Sentiment Ratio",fill ="Sentiment",title ="Total loughran Sentiment by Film" ) +theme(axis.text.x =element_text(angle =30, vjust =0.5, size =8))
Show code
ggplot(plot_sentiments_tbl) +geom_col(aes(x = title_cleaned %>%str_trunc(20), y = sentiment_ratio)) +facet_wrap(vars(sentiment), scales ="free_y") +labs(x ="Film Title",y ="Sentiment Ratio",title ="Facet Plot of Distribution of Sentiment Values" ) +theme(axis.text.x =element_text(angle =30, vjust =0.5, size =4),strip.text.x =element_text(size =8) )
4 Word and Document Frequency
We now want to look at the use of words within each film and overall as well.
In particular, we also account for differences between stage directions and dialogue, and so we also want to analyse the word tokens without excluding any stop words.
Show code
total_wordfreq_tbl <- films_wordtoken_unstopped_tbl |>count(word, sort =TRUE) |>mutate(freq = n /sum(n),rank =row_number() )ggplot(total_wordfreq_tbl) +geom_line(aes(x = rank, y = freq)) +scale_x_log10() +scale_y_log10() +labs(x ="Word Rank",y ="Word Frequency",title ="Log-Log Plot of Word Frequency vs Ranking" )
Overall, we see that word frequency is following a power-law, and it is worth exploring differences if we segment by individual film.
Show code
total_film_wordfreq_tbl <- films_wordtoken_unstopped_tbl |>count(title_cleaned, word, sort =TRUE) |>group_by(title_cleaned) |>mutate(freq = n /sum(n),rank =row_number() )ggplot(total_film_wordfreq_tbl) +geom_line(aes(x = rank, y = freq, colour = title_cleaned)) +scale_x_log10() +scale_y_log10() +labs(x ="Word Rank",y ="Word Frequency",title ="Log-Log Plot of Film Word Frequency vs Ranking" ) +theme(legend.position ="none")
4.1 Calculate Term Frequency - Inverse Document Frequency
We now want to look at the statistic known as the term frequency - inverse document frequency, the TF-IDF. This statistic calculates the relative frequency of each token in the corpus but then scales this by the inverse of its frequency of appearance in each document.
The effect of this is to show terms that appear frequently in only a subset of the documents - if a token appears in most or all of the documents, it is heavily downweighted by its high document frequency.
In terms of defining the idea of a ‘document’ in this, we start by considering each separate film to be a document.
Show code
total_film_tfidf_tbl <- films_wordtoken_unstopped_tbl |>count(title_cleaned, word, name ="word_count") |>bind_tf_idf(word, title_cleaned, word_count)total_film_tfidf_tbl |>slice_max(order_by = tf_idf, n =500) |>mutate(tf = tf |>round(4),idf = idf |>round(4),tf_idf = tf_idf |>round(4) ) |>datatable(rownames =FALSE,caption ="TF-IDF Statistics for Words" )
We see that for each film we have a high TF-IDF for a token that appears to be a character name in the film, so it is worth redoing this, but only looking at the dialogue text.
This requires a bit of processing of the data, as we want to collapse all the dialogue for a given character into a single document.
Despite filtering out the non-dialogue text, we are seeing very similar results for the TF-IDF values in the dataset. As many of them are names, this still makes sense, as it is likely that names are used as a part of the dialogue.
4.2 Work on Bi-Gram Data
We now repeat this process, but now look at the bi-grams.
We now want to look at the data and remove the stopwords where either of the words in the n-gram is on the list.
Show code
films_ngrams_stopped_tbl <- films_ngrams_tbl |>drop_na(word) |>mutate(token_word = word ) |>separate(token_word, c("word1", "word2"), sep =" ") |>anti_join(stop_words, by =c("word1"="word")) |>anti_join(stop_words, by =c("word2"="word")) |>select(-word1, -word2)films_ngrams_tfidf_tbl <- films_ngrams_stopped_tbl |>count(title_cleaned, word, name ="word_count") |>bind_tf_idf(word, title_cleaned, word_count)films_ngrams_tfidf_tbl |>slice_max(order_by = tf_idf, n =500) |>mutate(tf = tf |>round(4),idf = idf |>round(4),tf_idf = tf_idf |>round(4) ) |>datatable(rownames =FALSE,caption ="Film Dialogue - TF-IDF Words" )
4.3 Construct Graph Based on Bi-Grams
An alternative approach to looking at this data is to construct a directed graph of words with the edges being determined by the first and second word of the bi-gram.